In [27]:
%config InlineBackend.figure_format = 'retina'

Netflix Null Values¶



The problem of missing values in data is a common issue that can arise during the data collection and preprocessing stages. Missing values can occur for a variety of reasons, such as errors in data entry or measurement, or the intentional withholding of information. These missing values can have a significant impact on the quality and accuracy of any analysis performed on the data.

One of the most important steps in dealing with missing data is to handle it appropriately before any analysis is performed. This is where imputation comes in. Imputation is the process of replacing missing values with estimated ones. The goal of imputation is to minimize the loss of information and bias in the data while maximizing the usefulness of the remaining information.

It is important to address missing values before any analysis because they can lead to biased or incorrect results. For example, if the missing data is not handled properly, it can lead to a biased estimate of the mean or standard deviation. Additionally, many statistical models and machine learning algorithms cannot handle missing data and will either fail or produce inaccurate results.

Overall, imputing missing values is crucial to ensure the validity and accuracy of any analysis performed on the data.

  • Missing values can occur for a variety of reasons, such as errors in data entry or measurement, or the intentional withholding of information.

  • Imputation is the process of replacing missing values with estimated ones.

  • The goal of imputation is to minimize the loss of information and bias in the data while maximizing the usefulness of the remaining information.
  • Handling missing values before any analysis is performed is crucial to ensure the validity and accuracy of any analysis performed on the data.
  • Many statistical models and machine learning algorithms cannot handle missing data and will either fail or produce inaccurate results.

Goal¶

The goal of this notebook is to perform an in-depth analysis of the imputation of missing values using both Spark and Pandas, two widely used data processing libraries. The analysis will include a comparison of the different methods available for imputation in each library, and the robustness of each method will be evaluated using appropriate statistical measures. The notebook will also demonstrate the implementation of these methods in a real-world scenario.

The specific objectives of this notebook are:

  1. To introduce the problem of missing values in data and explain why it is important to address it before any analysis is performed.

  2. To present an overview of the different methods of imputation available in Spark and Pandas.

  3. To evaluate the robustness of each imputation method using appropriate statistical measures.

  4. To demonstrate the implementation of the imputation methods in a real-world scenario.

  5. To compare the results obtained from the methods in Spark and Pandas and draw conclusions about which library is more suitable for imputing missing values in a given context.

Overall, this notebook aims to provide a comprehensive understanding of the imputation of missing values in data using Spark and Pandas, and help data scientists and analysts make informed decisions when dealing with missing data.

Install Libraries¶

During the process of handling missing data, it is important to install the necessary libraries that may not be present in Data Bricks. As these libraries are added, a description and important aspects of each one will be included in the notebook to ensure proper use and understanding. In this notebook is to install the necessary libraries. Some of the libraries needed for this analysis may not be present in the current environment, so it is important to include their description and key aspects when they are added. This process will also ensure that all necessary functions and tools are available for use in the subsequent steps of the notebook.

pyspark¶

PySpark is the Python API for Apache Spark, a fast and general engine for large-scale data processing. It is designed to provide a simple and easy-to-use programming interface for parallel computing on clusters of machines. PySpark is built on top of Spark's Java API and exposes the Spark programming model to Python.

In [61]:
#!pip install pyspark

wordcloud¶

The wordcloud library will be used in this notebook to create visual representations of the data, specifically to compare the missing values using pandas and Spark. This library allows for the creation of word clouds from text data, where the size of each word represents its frequency in the text. This can be useful for identifying patterns and trends in the missing values, and can also be used for exploratory data analysis.

In [62]:
#!pip install wordcloud

Importing Data¶


Now, we will be importing the data using both Spark and Pandas. It is important to note that both of these libraries have their own unique advantages and disadvantages when it comes to handling large datasets. Spark is highly efficient and can handle large amounts of data with ease, but it can be more difficult to work with than Pandas. On the other hand, Pandas is relatively easy to use and is great for data manipulation and cleaning, but it may struggle with very large datasets. Therefore, it is important to consider the size and complexity of the dataset when deciding which library to use for data import. Additionally, it is also important to keep in mind that the specific use case and desired outcome of the analysis may also play a role in determining which library is best suited for the task at hand.

Spark¶

PySpark is the Python library for Spark programming that allows for easy and efficient processing of large amounts of data using the power of the Apache Spark engine. One of the main advantages of using PySpark in Databricks is its ability to scale up and distribute computations across multiple machines, which can greatly improve the performance of big data processing tasks. One important thing to consider is that in the community version of Databricks, the functionality to read csv files is not available using pd.read.csv.So the code to import data could be longer than Pandas library.

In [3]:
from pyspark.sql.functions import *
from pyspark.sql.types import *
from pyspark.sql import SparkSession
#Create Spark Session
spark = SparkSession.builder.getOrCreate()

# File location and type
file_location = "netflix_titles.csv"

# Define the schema
schema = StructType([
    StructField("show_id", IntegerType()),
    StructField("type", StringType()),
    StructField("title", StringType()),
    StructField("director", StringType()),
    StructField("cast", StringType()),
    StructField("country", StringType()),
    StructField("date_added", DateType()),
    StructField("release_year", IntegerType()),
    StructField("rating", StringType()),
    StructField("duration", StringType()),
    StructField("listed_in", StringType())
])

# The applied options are for CSV files. For other file types, these will be ignored.
spark_data = spark.read.format('csv') \
                  .option("inferSchema", 'false') \
                  .option("header", 'true') \
                  .option("dateFormat", "MMMM d, yyyy")\
                  .schema(schema) \
                  .load(file_location)

.option() is a method provided by the Spark DataFrame API that allows you to specify options when reading in data. These options can include things like the file format (e.g. CSV), whether to infer the schema automatically, and how to handle column names. The .option() method takes two arguments: the first is the name of the option, and the second is the value for that option. The options specified with .option() will be used when the .load() method is called to read in the data. Some typical options are:

  • inferSchema: By default, Spark will try to infer the schema of the data automatically when reading in a CSV file. This can be slow and can also result in unexpected data types. Setting this option to 'false' will prevent Spark from trying to infer the schema and instead use the schema that is specified in the next option, 'schema'.

  • header: By default, Spark assumes that the first row of the CSV contains the column names. Setting this option to 'true' will tell Spark to use the first row as the column names. If this is set to 'false', Spark will not use the first row as the column names and will instead generate default column names like "col1", "col2", etc.

  • dateFormat: By default, Spark will read dates in the format "yyyy-MM-dd" but this option allows you to specify any other format of your date field.

  • schema: This option allows you to specify the schema of the data explicitly. This can be useful if you know the schema ahead of time and want to avoid the overhead of inferring it automatically.

  • load: this option allow you to specify the path of your CSV file you want to import.

Using these options when importing a CSV file in Spark can help ensure that your data is read in correctly and with the desired schema, and can improve performance by avoiding the overhead of inferring the schema automatically.

In [4]:
import plotly.express._doc as xpdocs
import matplotlib.pyplot as plt
In [5]:
from itables import init_notebook_mode
import itables.options as opt
init_notebook_mode(all_interactive=True)
In [6]:
opt.column_filters= 'footer'
opt.classes = 'display nowrap cell-border'
opt.dom = 'lftipr'
opt.search = {"regex": True, "caseInsensitive": True, "smart":True, 'highlight':True}
opt.maxBytes = spark_data.toPandas().memory_usage().sum()
opt.paging = True
opt.autoWidth=False
opt.showIndex = False
opt.columnDefs=[{"width": "120px", "targets": "_all"}]
In [7]:
display(spark_data.toPandas())
show_id type title director cast country date_added release_year rating duration listed_in
Loading... (need help?)
show_idtypetitledirectorcastcountrydate_addedrelease_yearratingdurationlisted_in

Why is important to use options?¶

By using the options in the import statement, we ensure that the dataframe is in the correct format according to the schema specified. For example, by setting the "inferSchema" option to 'false' we are telling Spark to not try to infer the schema automatically and instead use the schema specified in the "schema" option. This ensures that the data is correctly mapped to the desired column types.

The "dateFormat" option is important because it allows you to specify the format of the date field in the CSV file. Without this option, Spark will assume the default format "yyyy-MM-dd" and if the date in CSV is not in this format it will convert it to null. In this case the format to 'date_added' column was 'September 9, 2019' so for coorrect read the format was 'MMMM d,yyyy' By specifying the correct date format, Spark will correctly parse the date field and map it to the correct column in the dataframe.

Overall, these options allow us to have more control over the format of the dataframe and make sure that it is in the correct format according to the schema.

Pandas¶

Using pandas for data import is a simpler process compared to PySpark. The read_csv() function allows for easy import of csv files, even when they are stored in a public repository and accessed via a URL. One major advantage of using pandas is that it does not require the creation of a schema for the variables. This can save time and effort when working with large datasets. However, it's important to keep in mind that Pandas is not as efficient as PySpark when working with big data. So, if you are dealing with large datasets, PySpark is the best option.

In [10]:
import pandas as pd

url = 'netflix_titles.csv'
pandas_data = df = pd.read_csv(url, parse_dates=['date_added', 'release_year'])
In [18]:
pandas_data['date_added'] = pd.to_datetime(pandas_data.date_added, format='%B %d, %Y').dt.date
pandas_data['release_year'] = pd.to_datetime(pandas_data.release_year).dt.year

date type¶

It is important to have the correct date format in pandas because it allows for accurate manipulation and analysis of the data. Inaccurate date formats can lead to errors or misinterpretation of the data. By using the pd.to_datetime() function and specifying the column, we are able to ensure that the 'date_added' column is correctly formatted as a date, and this allows for accurate analysis of the data based on specific dates. Additionally, using the pd.to_datetime() function to convert the 'release_year' column to datetime data and then extracts only the year component and store it, this way we are able to ensure that the 'release_year' column is correctly formatted as an integer and this allows for accurate analysis of the data based on specific years.

In [19]:
opt.maxBytes = pandas_data.memory_usage().sum()
pandas_data
Out[19]:
show_id type title director cast country date_added release_year rating duration listed_in description
Loading... (need help?)
show_idtypetitledirectorcastcountrydate_addedrelease_yearratingdurationlisted_indescription
In [20]:
pandas_data.columns
Out[20]:
Index(['show_id', 'type', 'title', 'director', 'cast', 'country', 'date_added',
       'release_year', 'rating', 'duration', 'listed_in', 'description'],
      dtype='object')

dates format¶

When reading in a data frame, it is important to make sure that the data is in the correct format for analysis. This is particularly important for columns that represent dates and times, such as the 'date_added' and 'release_year' columns.

In the case of the 'date_added' column, it is formatted as '2019-09-09'. This is a standard format for dates, with the year, month, and day separated by dashes. This format is easily readable and can be easily used for temporal analysis.

The 'release_year' column is also in the correct format, with the year formatted as an integer. This format is useful when performing statistical analysis by year. It is important that the column is in this format so that it can be used for groupby and aggregation operations.

Exploratory Data¶


Data info¶

Spark¶

The printSchema() method is a useful tool when working with DataFrames in Spark as it allows you to quickly view the data types and names of columns in a DataFrame. This method displays the schema of the DataFrame in a tree format, with each level of the tree representing a level of nested fields in the schema. By using this method, you can easily identify any issues or errors in the schema, such as incorrect data types. Additionally, it can help you understand the structure of your data and make any necessary adjustments before performing further analysis or operations on the DataFrame.

In [21]:
spark_data.printSchema()
root
 |-- show_id: integer (nullable = true)
 |-- type: string (nullable = true)
 |-- title: string (nullable = true)
 |-- director: string (nullable = true)
 |-- cast: string (nullable = true)
 |-- country: string (nullable = true)
 |-- date_added: date (nullable = true)
 |-- release_year: integer (nullable = true)
 |-- rating: string (nullable = true)
 |-- duration: string (nullable = true)
 |-- listed_in: string (nullable = true)

According to the schema provided, there are certain adjustments that need to be made to improve the data. In this case, the 'show_id' column is shown as an integer, however it must be converted to object to avoid including it in statistics.

When a column is defined as an integer, it is considered as a numeric column and it can be used in mathematical operations such as sum, average, standard deviation, etc. However, the 'show_id' column is not a numeric value, it is a unique identifier for each show, so it should not be included in mathematical operations and it should not be used as a measure in any statistical analysis.

Therefore, by converting the 'show_id' column from integer to object, we ensure that this column is treated as a string or categorical data, which is the correct data type for this column. This will prevent any errors or inaccuracies in the analysis of the data and will make the data more meaningful.

In summary, converting the 'show_id' column from integer to object is important to maintain the integrity of the data, and to prevent any errors or inaccuracies in the analysis of the data.

Pandas¶

The info() method in the Python library pandas is used to get a summary of the dataframe, including the name of the columns, their data types, and the number of non-null values.

When you use the info() method on a dataframe, it will display the following information:

  • The name of the dataframe
  • The number of rows and columns
  • The name of each column and its data type
  • The number of non-null values for each column
  • The memory usage of the dataframe
In [22]:
pandas_data.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6234 entries, 0 to 6233
Data columns (total 12 columns):
 #   Column        Non-Null Count  Dtype 
---  ------        --------------  ----- 
 0   show_id       6234 non-null   int64 
 1   type          6234 non-null   object
 2   title         6234 non-null   object
 3   director      4265 non-null   object
 4   cast          5664 non-null   object
 5   country       5758 non-null   object
 6   date_added    6223 non-null   object
 7   release_year  6234 non-null   int32 
 8   rating        6224 non-null   object
 9   duration      6234 non-null   object
 10  listed_in     6234 non-null   object
 11  description   6234 non-null   object
dtypes: int32(1), int64(1), object(10)
memory usage: 560.2+ KB

As you can see, the 'show_id' column is currently read as an integer. However, it is important to note that this column serves as a unique identifier for each show and should not be treated as a numerical value for statistical analysis.

In [23]:
pandas_data['show_id'] = pandas_data.show_id.astype('object')

Converting the unique identifier, such as the 'show_id' column, to an object type ensures that it will not be included in any statistical calculations, preserving the integrity of the data and avoiding any inaccuracies in the results. This approach is similar to how it is handled in Spark, where the 'show_id' is treated as an object, ensuring that it is not included in any statistics.

Netflix Data Null Values¶

Spark Nulls¶

The select method is used in Spark to select specific columns or to perform operations on the columns and create new columns. In this case, it is used to perform a count of the null values in each column of the DataFrame spark_data by using the count function and the when expression to count only the null values. The alias method is used to give the new column the same name as the original column.

Converting the resulting DataFrame to a Pandas DataFrame using toPandas() method has the advantage of being able to use the more advanced and convenient manipulation tools provided by Pandas, such as the sort_values and round methods. This allows you to easily sort the DataFrame by the total number of null values in descending order and round the percentage values to 3 decimal places. Also, with pandas you can use many libraries that are not available on pyspark like matplotlib, seaborn etc.

It is worth noting that converting the DataFrame to Pandas is not always the best option when working with large datasets, as it can cause performance issues due to the need to transfer the data from the Spark cluster to the local machine. However, if the dataset is small enough to fit in memory, converting to Pandas can be a useful tool for data manipulation and analysis.

In [24]:
spark_data_nulls = spark_data.select([count(when(col(c).isNull(), c)).alias(c) for c in spark_data.columns])
spark_nulls = spark_data_nulls.toPandas()
missing_spark_df = pd.DataFrame()
missing_spark_df['category'] = spark_nulls.columns
missing_spark_df['total'] = spark_nulls.iloc[0].values
missing_spark_df['percentage'] = missing_spark_df['total']/ spark_data.count()
missing_spark_df = missing_spark_df.sort_values('total', ascending=0).round(4)

Converting values to a tidy table: Why it matters?¶

Converting the total counts and percentages of null values in a DataFrame to a separate DataFrame is convenient for several reasons. One of the main benefits is the ability to easily visualize the data. By having the null values data in a separate DataFrame, it makes it simpler to create graphs and charts that show the distribution of null values across the columns. This can be useful for identifying patterns in the data and identifying which columns have a high number of null values, which can be useful in the data cleaning process.

Another advantage is that it facilitates data analysis. By having the null values data in a separate DataFrame, it makes it easier to analyze the data and make decisions about how to handle missing values. For example, you can use the percentage of null values to decide whether to drop a column, impute the missing values, or use some other strategy.

Additionally, having the null values data in a separate DataFrame also makes it easier to compare different datasets. You can compare the null values of two datasets to see how they differ and identify if one dataset is missing more data than the other.

In [25]:
opt.paging=False
display(missing_spark_df)
category total percentage
Loading... (need help?)
categorytotalpercentage

Data nulls results¶

Based on the results of the null values DataFrame generated with Spark, it appears that several columns have a significant number of missing values. The 'director', 'cast', 'country', 'date_added', and 'rating' columns have missing values at a rate of more than 7%.

This information can be used to make decisions about how to handle these missing values. There are several options for dealing with missing data, including dropping the columns, imputing the missing values, or using some other strategy.

  • Carl Shan, Max Song, Henry Wang, And William Chen - 120 real data science interview questions (2015)
  • DSI ACE PREP - Data Science Interview_ Prep for SQL, Panda, Python, R Language, Machine Learning, DBMS and RDBMS – And More – The Full Data Sci (2022, Data Science Interview Books)
  • Hemant Jain - Problem Solving in Data Structures & Algorithms Using Python_ Programming Interview Guide (2016, Createspace Independent Publishing Platform)
  • Vamsee Puligadda - 500 Most Important Data Science Interview Questions and Answers (2018, Indian Wolf Publications)
  • Vishwanathan Narayanan - Data Science and Machine Learning Interview Questions Using Python_ A Complete Question Bank to Crack Your Interview
In [26]:
from plotly.offline import download_plotlyjs, init_notebook_mode, plot, iplot 
init_notebook_mode(connected=False)
import plotly.express as px

scale = 3/4

reds = ['salmon','darksalmon','red','crimson','darkred']
reds2 = ['red', 'crimson', 'darkred']
item1 = px.bar(missing_spark_df, y='category', x='total', orientation='h',color= 'category', color_discrete_sequence=['red'], template='plotly_dark', height=625)

def update_style(item, title,size, h, w):
    fig = item.update_layout(height=h, width= w, title={'text': title, 'x':0.5, 'xanchor':'center', 'yanchor': 'top', 'font':{'size':size, 'color':'red', 'family':'Balto'}}, xaxis_showgrid=False, yaxis_showgrid=False, #plot_bgcolor='rgba(0, 0, 0, 0)', paper_bgcolor='rgba(0, 0, 0, 0)', 
                             template= 'plotly_dark')
    return fig
item1= update_style(item1, 'Spark null values', 40*scale, 500*scale, 875*scale)
item1.show()
In [27]:
item2 = px.pie(missing_spark_df, values='total', names='category', hole=0.4, template='plotly_dark', color_discrete_sequence=reds[::-1])
item2 = update_style(item2, 'Total Null values percentage',40*scale, 500*scale, 875*scale)
item2.show()

The graph shows the total number of missing values in each column, divided into categories. The categories are: 'director', 'date_added', 'cast', 'country', 'rating', 'release_year', 'listed_in', 'show_id', 'title', 'duration', and 'type'. The bar chart displays the total number of missing values in each category, and the pie chart displays the percentage from the total of missing values in each category. The results show that the column with the most missing values is 'director' with 1971 missing values, followed by 'date_added' with 661 missing values. The least number of missing values is in the 'show_id' and 'title' columns with only 2 missing values each. The pie chart displays that the 'director' column has the largest percentage of missing values at 53.1% from the total.

In [28]:
import matplotlib.pyplot as plt
from matplotlib.colors import LinearSegmentedColormap
from wordcloud import WordCloud

cmap = LinearSegmentedColormap.from_list("", ['#f00','#7a1b0c'])
spark_words = [(w, f) for w,f in zip(missing_spark_df.category, missing_spark_df.total)]
wordcloud = WordCloud(width = 1440, height = 190,
                background_color ='black',
                mode = "RGBA",
                colormap = cmap,
                min_font_size = 9,
                relative_scaling = 0.01,
                #stopwords = STOPWORDS,
                )
wordcloud.generate_from_frequencies(dict(spark_words))
# show words cloud
plt.figure(figsize = (10, 10), dpi= 300, facecolor = None)
plt.imshow(wordcloud)
plt.margins(x=0, y=0)
plt.axis("off")
plt.show()

Pandas Nulls¶

Whe we are talking about pandas, we can use the isna() method which is used to check for missing values in a DataFrame or Series. It returns a Boolean mask indicating whether each element is missing (True) or not (False). When this method is used with the sum() method, it returns the number of missing values in each column.

In [29]:
bool_na = pandas_data.isna().sum() > 0
total = pandas_data.isna().sum()[bool_na]
missing_pandas_df= pd.DataFrame()
missing_pandas_df['category'] = total.index
missing_pandas_df['total'] = total.values
missing_pandas_df['percentage'] = total.values/ pandas_data.shape[0]
missing_pandas_df = missing_pandas_df.round(4)

Creating a "tidy" table is an important step when working with data in order to have greater efficiency when generating interactive charts.

In [30]:
display(missing_pandas_df)
category total percentage
Loading... (need help?)
categorytotalpercentage

From these results, we can be concluded that the columns 'director', 'cast' and 'country', have the highest number of missing values, while 'date_added' and 'rating' have the lowest number of missing values. Overall, it seems that the dataset has a moderate amount of missing values.

In [31]:
item3 = px.bar(missing_pandas_df, y='category', x='total', orientation='h', color= 'category', color_discrete_sequence=['red'], template='plotly_dark')
item3 = update_style(item3, 'Pandas null values', 40*scale, 500*scale, 875*scale)
item3.show()
In [32]:
item4 = px.pie(missing_pandas_df, values='total', names='category', hole=.4, opacity=.7, color_discrete_sequence= reds[::-1], template='plotly_dark')
item4 = update_style(item4, 'Total Null values percentage', 40*scale, 500*scale, 875*scale)
item4.show()

The comparison of the total missing values between Spark and Pandas shows that Pandas has lower missing values in the columns "director", "cast", "country", and "rating". The total number of missing values in the "director" column is 1969 in Pandas and 1971 in Spark. The total number of missing values in the "cast" column is 570 in Pandas and 571 in Spark. The total number of missing values in the "country" column is 476 in Pandas and 478 in Spark. And the most important result was the total number of missing values in the "date_added" column which is 11 in Pandas and 661 in Spark**.

In [33]:
cmap = LinearSegmentedColormap.from_list("", ['#f00','#7a1b0c'])
pandas_words = [(w, f) for w,f in zip(missing_pandas_df.category, missing_pandas_df.total)]
wordcloud = WordCloud(width = 1440, height = 190,
                background_color ='black',
                mode = "RGBA",
                colormap = cmap,
                min_font_size = 1,
                relative_scaling = 0.03,
                #stopwords = STOPWORDS,
                )
wordcloud.generate_from_frequencies(dict(pandas_words))
# show words cloud
plt.figure(figsize = (10, 10), dpi= 300, facecolor = None)
plt.imshow(wordcloud)
plt.margins(x=0, y=0)
plt.axis("off")
plt.show()

Nulls Data Merge¶

The merge function in pandas can be used to compare and combine the missing values identified by Spark and pandas. The function links rows of two dataframes based on one or more common columns, known as the "key" columns. It is possible to specify the type of join to be performed using the "how" parameter, which can take values such as "left", "right", "outer", or "inner".

Additionally, the merge function can handle duplicate column names by appending a suffix to the column names of one of the dataframes. This can be done using the "suffixes" parameter, which takes a tuple of two strings to be appended to the column names of the left and right dataframes, respectively.

In [40]:
missing_spark_pandas = missing_spark_df.merge(missing_pandas_df, on='category', how='outer', suffixes=('_spark', '_pandas')).fillna(0)
missing_spark_pandas['difference'] = missing_spark_pandas.total_spark - missing_spark_pandas.total_pandas
missing_spark_pandas['total_cum_spark'] = missing_spark_pandas.total_spark.cumsum()
missing_spark_pandas['total_cum_pandas'] = missing_spark_pandas.total_pandas.cumsum()
missing_spark_pandas['diff_cum'] = missing_spark_pandas.difference.cumsum()
display(missing_spark_pandas)
category total_spark percentage_spark total_pandas percentage_pandas difference total_cum_spark total_cum_pandas diff_cum
Loading... (need help?)
categorytotal_sparkpercentage_sparktotal_pandaspercentage_pandasdifferencetotal_cum_sparktotal_cum_pandasdiff_cum

For each category in the dataset, the number of missing values and the percentage of missing values are provided for both Spark and Pandas. The difference between the two is also calculated. The cumulative total and cumulative difference between Spark and Pandas is also calculated.

Based on the results, it appears that there are differences in the number of missing values found by Spark and Pandas for several categories. For example, for the "director" category, Spark found 2 more missing values compared to Pandas. For the "date_added" category, Pandas found 650 fewer missing values than Spark. These differences may indicate that the two libraries handle missing data differently.

So, we are going to work with the data imported using pandas.

The reasons are:

Advantages:

  • Lower number of missing values - Pandas has fewer missing values in most of the categories compared to Spark.
  • Lower number of columns with missing values - Pandas has fewer columns with missing values compared to Spark.

Exceptions:

  • It is important to note that the choice of the best tool for data import depends on various factors such as the size of the dataset, computational resources, and the required processing power. In this case, Pandas appears to be the better choice for the given results.
In [41]:
import plotly.io as pio
pio.templates
Out[41]:
Templates configuration
-----------------------
    Default template: 'plotly'
    Available templates:
        ['ggplot2', 'seaborn', 'simple_white', 'plotly',
         'plotly_white', 'plotly_dark', 'presentation', 'xgridoff',
         'ygridoff', 'gridon', 'none']
In [42]:
item5 = px.bar(missing_spark_pandas, x='category', y=['total_spark', 'total_pandas', 'difference'], orientation='v', barmode='group', color_discrete_sequence=reds2, template='plotly_dark')
item5 = update_style(item5, 'Total Null Values Comparision', 40*scale, 450,720)
item5.show()
In [43]:
item6 = px.line(missing_spark_pandas, x='category', y=['total_spark', 'total_pandas','difference'], color_discrete_sequence=reds2, template='presentation')
item6 = update_style(item6, 'Total Null Values Comparision', 40*scale, 450,720)
item6.show()

Comparing Pandas and Spark Data Processing: Key Findings and Insights¶

From the bar and line graphs that were generated, it can be seen that the number of missing values for each column in the imported data sets of Spark and Pandas were compared. The difference in missing values for the "date_added" column is evident, with a difference of 650. The graph highlights that in this case, Pandas was better optimized in processing the date format compared to Spark, which had difficulty correctly processing the date format. This can be concluded by comparing the missing value count of "date_added" in Spark (661) to that in Pandas (11).

In [44]:
item7 = px.line(missing_spark_pandas, x='category', y=['total_cum_spark', 'total_cum_pandas'], color_discrete_sequence=reds2, template='presentation')
item7 = update_style(item7, 'Total Cumulative Null Values Comparision', 40*scale, 450, 1440*scale)
item7.update_layout(template='presentation')
item7.show()

In the line graph, it is evident that the data imported through Spark has a higher cumulative total of 3713 null values, compared to the data imported through Pandas with a cumulative total of 3036 null values. This difference can be attributed to the significant number of null values present in the data imported through Spark. The graph clearly demonstrates this disparity between the two datasets.

In [45]:
word= list(missing_spark_df.category.values)+list(missing_pandas_df.category.values)
freq= list(missing_spark_df.total.values)+list(missing_pandas_df.total.values)
cmap = LinearSegmentedColormap.from_list("", ['#f00','#7a1b0c'])
words = [(w, f) for w,f in zip(word, freq)]
wordcloud = WordCloud(width = 1440, height = 249,
                background_color ='black',
                mode = "RGBA",
                colormap = cmap,
                min_font_size = 1,
                relative_scaling = 0.03,
                #stopwords = STOPWORDS,
                )
wordcloud.generate_from_frequencies(dict(words))
# show words cloud
plt.figure(figsize = (10, 10), dpi= 300, facecolor = None)
plt.imshow(wordcloud)
plt.margins(x=0, y=0)
plt.axis("off")
plt.show()

Impute Null Values¶

In many cases, data is collected and used solely for informational purposes, such as creating dashboards or visualizations. In such scenarios, it is not necessary to perform complex imputation methods, such as machine learning, to fill in missing values. Instead, a simple approach, such as removing rows with missing values or replacing them with a central tendency measure, can be sufficient.

This approach is known as simple imputation and is suitable for datasets where the goal is to provide a general overview rather than making predictions or performing in-depth analysis. In the case of Netflix movies, for example, missing values may be present in the data for various reasons, such as data entry errors or missing information from the source.

By removing rows with missing values or replacing them with a central tendency measure, such as the mean or median, we can still provide a general overview of the data and display it in a meaningful way. This approach is also quick and easy to implement, making it a practical solution for data that is used solely for informational purposes.

In [46]:
data = pandas_data.copy()
In [47]:
item8 = px.imshow(data.isna(), color_continuous_scale= reds2[::-1], template='plotly_dark')
item8 = update_style(item8, 'Null Values Matrix', 40*scale,450, 1440*scale)
item8.show()
In [48]:
opt.paging = False
display(missing_pandas_df)
category total percentage
Loading... (need help?)
categorytotalpercentage
In [49]:
missing_types= data[missing_pandas_df.category].dtypes.reset_index()
missing_types.columns = ['category', 'data_type']
display(missing_types)
category data_type
Loading... (need help?)
categorydata_type

Exploring and Cleaning¶

In [50]:
opt.paging= True
display(data[data.director.isna()])
show_id type title director cast country date_added release_year rating duration listed_in description
Loading... (need help?)
show_idtypetitledirectorcastcountrydate_addedrelease_yearratingdurationlisted_indescription
In [51]:
data.director.fillna('no director', inplace=True)
display(data.director)
director
Loading... (need help?)
director

The 'director' column was identified as having null values. To address this issue, a simple and straightforward solution was employed. The missing values were replaced with the string 'no director' string.

In [37]:
display(data[data.cast.isna()])
show_id type title director cast country date_added release_year rating duration listed_in description
Loading... (need help?)
show_idtypetitledirectorcastcountrydate_addedrelease_yearratingdurationlisted_indescription
In [52]:
data.cast.fillna('no cast', inplace=True)
data.cast
Out[52]:
cast
Loading... (need help?)
cast

The 'cast' column was identified as having null values. To address this issue, a simple and straightforward solution was employed. The missing values were replaced with the string 'no cast' string.

In [53]:
display(data[data.country.isna()])
show_id type title director cast country date_added release_year rating duration listed_in description
Loading... (need help?)
show_idtypetitledirectorcastcountrydate_addedrelease_yearratingdurationlisted_indescription
In [54]:
data.country.fillna('no country', inplace=True)
In [55]:
data.country
Out[55]:
country
Loading... (need help?)
country

The 'country' column was identified as having null values. To address this issue, a simple and straightforward solution was employed. The missing values were replaced with the string 'no country' string.

In [56]:
display(data[data.date_added.isna()])
show_id type title director cast country date_added release_year rating duration listed_in description
Loading... (need help?)
show_idtypetitledirectorcastcountrydate_addedrelease_yearratingdurationlisted_indescription

In the case of columns 'date_added' and 'rating', the null values represent a very small percentage of the total data, around 0.2%. Therefore, the most straightforward solution is to simply drop these rows and clean the data, rather than imputing or replacing these values with any other information. This approach reduces the risk of introducing inaccuracies or biases in the data, and helps to maintain the integrity of the information being presented in the dashboard. By removing the null values in these columns, the data will be more consistent and provide a clearer representation of the available information.

In [57]:
data.dropna(subset=['date_added', 'rating'], inplace=True)
data[['date_added','rating']]
Out[57]:
date_added rating
Loading... (need help?)
date_addedrating

No Null values¶

In [58]:
count_missing = data.isna().sum().reset_index()
count_missing.columns = ['category', 'null values sum']
display(count_missing)
category null values sum
Loading... (need help?)
categorynull values sum
In [59]:
item9 = px.imshow(data.isna(), color_continuous_scale= reds2, template='plotly_dark')
item9 = update_style(item9, 'Null Values Matrix', 40*scale, 450, 1440*scale)
item9.show()
In [46]:
data.to_csv('netflix_titles_clean.csv', index=False)

The results of the visualization show that the data set no longer contains null values and is almost ready to create the dashboard. The graph represents the presence of null values in the data set and the use of color represents the concentration of missing values. The use of red shows higher concentration of missing values, while the darkred color indicates lower or no missing values.

Layout¶

In [29]:
!jupyter nbconvert --to html --theme jupyterlab_miami_nights  --output Netflix_Null_Values.html Netflix_Null_Values.ipynb
[NbConvertApp] Converting notebook Netflix_Null_Values.ipynb to html
[NbConvertApp] Writing 13484878 bytes to Netflix_Null_Values.html